Mitochondrial DNA variation, genetic structure and demographic history of Iranian populations.

In order to survey the evolutionary history and impact of historical events on the genetic structure of Iranian people, the HV2 region of 141 mtDNA sequences related to six Iranian populations were analyzed. Slight and non-significant FST distances among the Central-western Persian speaking populations of Iran testify to the common origin of these populations from one proto-population. Mismatch distribution suggests that this proto-Iranian population started to colonize Iran about 30000 years ago which is almost consistent with the timing of arrival and colonization of western Asia by the anatomically modern human. Star-like haplotype network structures, significant and negative Tajima's D (D=-2.08, P<0.05) and unimodal mismatch distributions support the genetic effects of this expansion. Iranian populations presented mtDNA lineages that clearly belong to the European gene pool (i.e. H and U), while the Mashhad population was characterized by the presence of eastern and central Asian mtDNA lineages (i.e. M, B and D). Furthermore, the low diversity (h=0.428) observed in Mashhad may indicated the presence of inbreeding, drift or bottleneck events. The application of Monmonier's maximum differences algorithm revealed a geographic zone of genetic discontinuity between the Arab people of Khuzestan and rest of Iranian populations. Geographical factors, in cooperation with cultural/linguistic differences, are the main reasons for this differentiation. The lack of a sharp geographical or ethno-linguistic structure for mtDNA HV2 sequence diversity was statistically supported by AMOVA and Mantel (r=0.19, P<0.05) tests.


INTRODUCTION
Present-day Iran has played a key role in the distribution of the modern human, and has acted as a corridor and natural inter-continental passageway for the expansion of genes [1]. The Neolithic and Metal Ages seem to be the time windows that left the MBRC http://mbrc.shirazu.ac. ir 47 populations? (3) Is there any specific linguistic or geographic structure governing the mtDNA diversity? (4) Can a source population for the Persian speaking populations from Iran be identified? (5) Is there any genomic boundaries between the Iranian populations? and (6) Which of the evolutionary forces been involved in shaping the genetic landscape of present-day Iranians. The comprehension of this specific case study can help clarify the genetic structure and origin of Iranian populations.

MATERIALS AND METHODS
Study populations: HV2 nucleotide sequences of 141 individuals from six Iranian populations including five Persian speaking populations from Tehran [32], Esfahan, Yazd, Shiraz, Mashhad, and one Arab population from Khuzestan province (GenBank Accession Numbers EU239536 to EU239655) were obtained from the GeneBank database (Table 1). For inter-population comparison purposes and estimate the effects of neighboring ethnic group's mtDNA pools on the Iranians genetic landscape, sets of the HV2 nucleotide sequences were obtained from three regional groups of populations (Fig. 1). These were populations from (1) Central Asia [33] including Kazakhs, Kirgizes, Tajiks, Turkmens, Afghans and Russians; (2) Pakistan including Baloches, Brahuis, Burushos, Hazaras, Kalashes, Makranis, Pathans and Sindhis (GenBank Accession Numbers EU565766 to EU566829); and (3) the Anatolia/Caucasus region including Armenians, Georgians, Azeris, Turks [32], Iraqi Kurds [34] and Adygei people of the Caucasus region.  Sequence alignment: Sequence alignment was first performed using the ClustalW procedure implemented in Mega, version 5.2, and then by hand [35].
Statistical analysis: Basic parameters of molecular diversity such as the number of haplotypes (H), the number of polymorphic sites (s), the mean number of nucleotide differences (k) [36], and nucleotide (π) and haplotype (h) diversity [37] were calculated for each population using Arlequin package version 3.5 [38]. Mega version 5.2, was used to align HV2 sequences to the revised Cambridge reference sequence (rCRS) [39] and detect the polymorphic sites. mtDNA haplogroups were determined based on diagnostic sites in the HV2 region following the mtDNA tree Build 15 http://mbrc.shirazu.ac.ir 49 (http://www.phylotree.org/) [40]. Evolutionary relationships of the observed mtDNA haplotypes were displayed by a phylogenetic method known as NeighborNet [41] using the SplitsTree version 4 software package [42]. Thus, SplitsTree was employed to build a split network depicting the proximity among haplotypes in a non-dichotomous fashion, with the uncorrected P, NeighborNet distance and Equal Angle algorithm methods (default options). The advantage of this type of cluster analysis is that it allows the cycles or reticulations within evolutionary pathways to accommodate the elevated mutation rates and the corresponding homoplasy of particular genetic systems [43][44][45]. The best probabilistic model of sequence evolution was determined using the software JModeltest version 2.1.3 [46] and the Akaike information criterion. Pairwise F ST genetic distance values were calculated based on the number of pairwise differences between sequences and the K2P model of nucleotide substitutions. The statistical significance of pairwise F ST genetic distances was estimated by permutation analysis using 10000 random permutations. These values were used to evaluate the genetic differentiation of different populations. A neighbor-joining tree [47] was built from the F ST distance matrix. The distance matrix was also represented by non-linear multidimensional scaling (NM-MDS) using the STATISTICA 10 package (StatSoft Inc.) [48].
Changes of effective population size through time were examined following two different approaches; 1) a neutrality test against population growth and 2) the distribution of pairwise differences (mismatch distribution or MMD). First, potential departures from a null hypothesis of the mutation-drift equilibrium and constant population size were estimated by computing the Tajima's D test for selective neutrality [49]. Thus, negative values of Tajima's D statistic could reveal recent demographic expansions. Second, we analyzed the distribution of all pairwise haplotype differences and calculated the goodness-of-fit of the estimated distribution to that predicted by a sudden expansion model using 1000 computer simulations [50]. Mismatch distributions were graphically displayed in Microsoft Excel 2007. Mismatch distributions tend to be unimodal, and smooth (i.e. wave-like) in populations that have undergone population size changes. Multimodal or random and rough distributions are characteristics of populations that have experienced long-term stability [51,52]. The significance or goodness-of-fit of the observed data to the predicted distribution modeled for sudden expansion growth was assessed by using a sum of squares (SSD) method and raggedness index (rg) [53,54]. Significant differences in the sum of the square deviations (P SSD <0.05) and raggedness index (P rg <0.05) between the observed and simulated mismatch distributions indicated that the population was at a mutation-drift equilibrium (i.e. in a non-expansion phase) [51,52]. When observed distributions fit the sudden expansion model (P SSD ≥0.05) using Arlequin version 3.5, the time since the beginning of the expansion (t) was estimated from the peak of the distribution (i.e. τ) as t = τ/2μ [55], where μ is the rate of mutation per site per million years multiplied by sequence length.
Based on a Delaunay triangulation connectivity network, Monmonier's maximumdifference algorithm [56][57][58] was used to identify genetic boundaries, namely, geographic zones where differences between populations were largest. The algorithm MBRC http://mbrc.shirazu.ac.ir 50 was applied using the Barrier 2.2 program [59]. To identify groups of neighboring populations with maximum genetic differentiation, algorithmic analysis of molecular variance (AMOVA) was applied to the groups classified according to their geographic and linguistic affiliation. This test calculates fixation indices (i.e. Φ ST , Φ SC and Φ CT ) [60], analogous to Wright's F-statistics [61], allowing the researchers to investigate hierarchical population structure by differentiating variation between groups versus variation within each group. Significance levels of genetic variance components as well as Φ values were evaluated by using 1000 permutations. Eventually, the statistical significance of the correlation between geographic and F ST genetic distance matrices was evaluated by the Mantel test [62] with 1000 permutations using the R vegan library [63]. The Geographic Distance Matrix Generator software, version 1.2, was used to make a geographic distance matrix [64].

RESULTS
Genetic diversity: Using 294 bp long sequences comprising nucleotide positions 48 to 342 of the mtDNA control region, we recognized 125 haplotypes in 1694 individuals, which 34 of them observed in the Iranians (Table 1). In addition, 22 out of 34 haplotypes (64.7%) were singletons and only 12 (35.29%) were shared between Iranians. Haplotypes no.3 and no.1 showed the highest frequencies in Iranians (in 56 and 23 individuals, respectively). The unrooted SplitsTree NeighbourNet network in Figure 2 provides a graphic representation of the groups of haplotypes which is not purely dichotomous. Reticulation indicates alternative mutational pathways (i.e. homoplasy) that occur mostly inside each group, as is often the case with D-loop sequences. This allowed us to assign each Iranian haplotype to one of the haplogroups identified. Some parameters characterizing within-population diversity of the mtDNA sequences, such as sample size (n), number of polymorphic sites (s), number of haplotypes (H), haplotype diversity (h), nucleotide diversity (π), and mean number of pairwise differences (k) are listed in Table 1. Global haplotype diversity was found to be 0.673, ranging from 0.932 for Arab people of Khuzestan province, to 0.428 for Mashhad. Other Iranian populations presented sequence diversities of 0.854 (Tehran) to 0.792 (Yazd). The Central Asian populations presented sequence diversities of 0.718 (Russians) to 0.504 (Kyrgizes). Pakistanians exhibited mtDNA sequence diversities ranging from 0.756 (Kalashes and Baluches) to 0.466 (Hazaras); and the sequence diversity range in the Caucasus region was found to be from 0.8 (Turks) to 0.51 (Azeris). Nucleotide diversity ranged from 0.017 for Arab people of Khuzestan province, to 0.003 for those from Hazara and 0.004 for the Iraqi Kurds (Table 2). In addition, the high and low diversities observed in Khuzestan province and Mashhad were clearly evident in high and low levels of the mean number of pairwise differences (5.25 and 2.047, respectively). The low diversity observed in Mashhad and the Iraqi Kurds may testify to the presence of evolutionary forces such as inbreeding, drift or bottleneck events.   Haplogroup definition: Each mtDNA haplotype was assigned to a particular mitochondrial haplogroup on the basis of mtDNA TreeBuild, version 15, of the PhyloTree. The frequency distribution of the mtDNA haplogroups was inferred from data on HV2 nucleotide sequences. Data on the distribution of mtDNA haplogroups in all Iranian populations under study are summarized in Table 2. Figure 3 depicts the geographic distribution of the observed mtDNA haplogroups. The mtDNA pools of all Iranian populations were characterized by the presence of European mtDNA haplogroups H and U (at the frequencies of 27.65% and 12.76%, respectively). Other common haplogroups observed were T (11.34%) and J1 (9.92%). Apparently, these mtDNA haplogroups are hardly suitable for studying inter-population relationships and only testify to the common origin of Iranian populations from one proto-population. The distribution of rare or unique haplogroups in populations is more informative. Approximately 58% of all common mtDNA haplogroups (11 from 19 haplogroups) were relatively rare, occurring in 1-2 out of the 6 populations under study. Two singleton haplogroups F and D were observed in Mashhad while, another two singleton haplogroups P and G were observed in the Tehran population. A Mongoloid component observed in Mashhad with the frequency of about 28.57% was represented by haplogroups M, B, and D. Thus, the data on mtDNA polymorphism indicated pronounced differentiation between western-central and eastern Iranians. Eastern Iranians (i.e. the Mashhad population) were characterized by an mtDNA pool composition similar to that of eastern and central Asia while western-central Iranian populations were characterized by an mtDNA pool composition similar to that of Europe and eastern Asia. Demographic analysis: Mismatch distribution analyses (MMD) were carried out using goodness-of-fit tests based on sum of squared deviations and raggedness index (Table 3). When applied to the pooled data set of Iran, mismatch distribution was unable to reject the model of sudden expansion (P (sim ≥ obs) > 0.05) (Fig. 4G). Pooling differentiated samples, however, entail some bias; therefore we conducted the analysis population by population (Fig. 4). Mismatch curves of HV2 haplotypes were smooth and unimodal in almost all examined Iranian populations, with the exception of Mashhad (P SSD <0.05). Tajima's D was strongly negative and significantly different from zero for the pooled data set (D=-2.08, p<0.05) and almost all single Iranian ethnic groups with the exception of Esfahan (D=-1.18, P>0.05) and Arabs from Khuzestan province (D=-1.38, P>0.05) ( Table 1). Thus, tajima's D estimates were further confirmation for the recent expansions, which reflect an excess of singletons and lowfrequency variants in the surveyed mtDNA pools resulted from recent demographic expansions. It should be noted, however, that other factors including background selection and mutation rate heterogeneity can account for deviations in these statistics [65,66]. All these clues support the expansion model for Iran, which implies an excess of recently diverged haplotypes and a deficit of deeper coalescence events.
From the observed distribution of pairwise differences (MMD), it is possible to estimate the parameters of the theoretical model (i.e. τ) proposed by Rogers and Harpending (1992) or its simplified version [54] (Table 3). In addition, change in the effective population size can be estimated by calculating two successive values of the θ includes θ 0 =2N 0 µ and θ 1 =2N 1 µ, where N 0 and N 1 indicate the effective population size in the past and present, respectively, and μ denotes the mutation rate of the human mtDNA D-Loop region [50] (Table 3). Time elapsed since the beginning of expansion  was estimated from the equation t = τ/2µ, where t is the time since expansion and µ is the per nucleotide mutation rate multiplied by the sequence length. Estimates of the time elapsed since the beginning of expansion for the pooled data set and each Iranian population (with the exception of Mashhad) are given in Table 3. With a mutation rate of 32 % / site/ Myr [67], the τ-value of 5.66 obtained by MMD on the pooled data set is translated into an expansion time of about 30080 (55909-2338) years ago, while that of the Khuzestan province sample alone is 33343 (53084-10483). The time elapsed since the beginning of expansion for other Iranian populations ranged from 21497 (Shiraz) to 28294 (Esfahan) years ago.
Differentiation among populations: Pairwise F ST genetic distances between populations were calculated and their statistical significance was estimated by 10000 permutations ( Table 4). The pairwise F ST estimates between the Persian speaking populations including Tehran, Esfahan, Shiraz and Yazd were insignificantly low, suggesting little genetic differentiation between population pairs, which could be possibly attributable to high gene flow or the recent common origin of Persian speaking populations. In addition, Iranians, with the exception of those from Mashhad, showed high levels of differentiation with the Central Asian groups. The people from Mashhad showed lower levels of differentiation with the Azeris (-0.011) from Azerbaijan, Central Asian Kazakhs (-0.0069), Kirgizes (-0.0066) and Pathans (-0.006) from northern Pakistan. Based on the pairwise F ST genetic distances, previous study shows that the Central Asian mtDNA sequences presented features between those of the Europeans and eastern Asians [68]. Several hypotheses could explain this intermediate position, but the most plausible would involve extensive levels of admixture between Europeans and eastern Asians in Central Asia.  On the population level, phylogenetic tree were constructed from an F ST genetic distance matrix for HV2 sequence data using the Neighbor-Joining (NJ) algorithm (Fig.  5). This clustering approach was used because it does not assume an evolutionary clock (i.e. the tree is unrooted) and produces more accurate results when closely related populations, such as human groups are analyzed [47]. Towards the top of the tree, Central Asia populations, with the exception of Russians and northwestern populations of Pakistan including Hazaras and Makranis, were placed in a single cluster, and the cluster to the left of the tree includes the remaining Iranians from Tehran, Yazd, Shiraz, Esfahan and Khuzestan province, and Russians from Caucasus and Central Asia regions ( Figure 5).
Of the Iranian populations, Mashhad is positioned closest to eastern populations, namely from Central Asia and Pakistan. A tree representation of genetic distances may be misread as a succession of evolutionary splits which is inappropriate for populations below the species level, therefore multidimensional scaling was performed on the F ST distance matrix for HV2 sequence data, in order to provide a visual representation of the genetic relationships in two and three-dimensional space ( Figure 6). Central-western Persian speaking populations of Iran cluster together, with the dispersion being mostly in the middle of the MDS plot. As for the NJ tree, the Mashhad population was the closest to the eastern communities including those of Central Asia and Pakistan, whereas the Arab population from Khuzestan province was positioned far from other populations. Population structure: The lack of a sharp geographic or ethno-linguistic structure for mtDNA HV2 sequence diversity was statistically supported by different tests. According to AMOVA results, when population samples were subdivided based on either spoken language (Iranian, Indo-Aryan, Altaic, Semitic, Armenian, Karto-Zan and Abkhazo-Adyghean language groups) or geography (pattern 1: the western (the Caucasians), the Central (the Iranians), and the eastern (the Pakistanian and Central Asian) groups; pattern 2: the western (the Caucasians), the Central (the Iranians), the southeastern (the Pakistanians), and the northeastern (the Central Asians) groups), the among-groups component of variance (i.e. Φ CT ) was always low ( Table 5). The majority of haplotype variation was indeed significantly accounted for within population differences (95.44-96.13%, p<0.001). In order to further investigate the patterns of genetic variation in geographic space, the Mantel test was used to measure the correlation between geographic and Pairwise F ST genetic distance matrices. The results showed a low but significant positive correlation between the genetic and geographical distance matrices (r=0.19, P<0.05), indicating that the levels of genetic resemblance between populations is weakly dependent on geographic distances.
The zones of sharpest HV2 changes or putative genetic boundaries were identified using Monmonier's algorithm based on F ST genetic distances. Localities were connected according to adjacency criteria, thus defining a so-called Delaunay triangulation (Fig.  7). The calculated genetic distances between populations were connected by single B) MBRC http://mbrc.shirazu.ac.ir 59 edges of the network. From the edges associated with the highest genetic distances, an arbitrary number of lines of maximum genetic differentiation, or genetic boundaries, were traced. The significance of each identified boundary was eventually tested by AMOVA which compares the samples on either side of that boundary. The genetic boundary inferred from HV2-based distances between populations is shown in Figure 7.
A genetic barrier with the maximum degree of genetic differentiation was located between the Arab people of Khuzestan province and other Iranians. Geographical factors (i.e. residence in the southern parts of the Zagros Mountains) or limited genetic exchange due to cultural/linguistic differences are the main reasons of this differentiation.

DISCUSSION
The Iranian populations of this study come from different regions and speak different languages. The populations of Esfahan, Shiraz and Khuzestan reside in geographical proximity, but speak different languages, whereas those from Mashhad, Esfahan and Shiraz speak the same language (i.e. Persian) [69] despite being located in different geographic areas. Non-significant F ST genetic distances for mtDNA data indicate close genetic relationships among the four Iranian populations (i.e. Tehran, Yazd, Esfahan, and Shiraz), while Mashhad and Khuzestan differ from these and from each other. The high genetic similarity of these populations suggests the existence of a single proto-Iranian population in the distant past from which greatly diverse modern Iranian ethnic groups (i.e. the Persian speaking populations) originated. The major mtDNA lineages in Iranians all exhibit a star-like network (Fig. 2) structure and unimodal mismatch distributions, which suggest the genetic effects of population expansion. The distribution of nucleotide differences between haplotypes suggests that this proto-Iranian population started to colonize Iran about 30080 years ago, which is approximately consistent with the timing of arrival and colonization of west Asia by the anatomically modern human [70]. However, additional research such as cultural practices, isotope chemistry, and mtDNA haplogroup frequencies for archaeological specimens needs to be carried out in order to confirm this scenario. Clearly, the discrepancy in mutation rates has important implications for reconstructions of human evolutionary history based on mtDNA variation, and thus caution is required when interpreting the results. The differences between estimated times regarding the beginning of colonization in west Asia by the anatomically modern human are due to the use of different mutation rates for the human mtDNA D-Loop region [67].
The lack of a sharp geographical or ethno-linguistic structure for mtDNA HV2 sequence diversity was statistically supported by the AMOVA test. Additionally, the Mantel test revealed a small but significant correlation (r=0.19, P<0.05) indicating that the level of genetic resemblance between populations was slightly dependent on geographic distance [62]. These results suggest that the common origin feature of the Indo-Iranian populations is obscured by the effects of gene flow from neighboring non-Indo-European populations.
Furthermore, five populations from Iran including Tehran, Yazd, Esfahan, Shiraz and Mashhad have the same linguistic origin; nevertheless, the Mashhad population is genetically different from the others, being characterized by an mtDNA pool composition similar to that of eastern and Central Asia (i.e. high frequency of haplogroups M, B and D) while the western-central Iranian populations were characterized by an mtDNA pool composition similar to that of Europe and western Asia, a finding consistent with previous studies [71]. According to pairwise F ST genetic distances, the Mashhad population showed a close genetic relationship with the Central Asian populations; suggesting the presence of a gene flow from these populations. Central Asian mtDNA sequences presented features between European and eastern Asian sequences [68] in several parameters such as frequencies of certain nucleotides, MBRC http://mbrc.shirazu.ac.ir 61 levels of nucleotide diversity, mean number of pairwise differences, and pairwise F ST genetic distances. Several hypotheses could explain the intermediate position of Central Asia between Europe and the eastern Asia, but the most plausible would involve extensive levels of admixture between the Europeans and eastern Asians in Central Asia, possibly enhanced during the Silk Road trade and clearly after eastern and the western Eurasian human groups had diverged [72]. In addition, the lowest diversity among the Iranians, belonged to the Mashhad population (h= 0.428), which indicates the effect of evolutionary forces such as inbreeding, drift or bottleneck events during the evolution of this population. Furthermore, Yazd, Shiraz, Esfahan and Khuzestan province populations reside in geographical proximity, but the Arab population of Khuzestan is genetically distant from the others, reflecting their different origin. The Persian language belongs to the Iranian branch of the Indo-European family, whereas the Arabic language spoken in parts of Khuzestan province belongs to the Semitic family [69]. In the present study, using Monmonier's maximum differences algorithm [57] based on the F ST genetic distances, we showed differentiation among the Arab people of Khuzestan province and other Iranians. Geography [73] and cultural/linguistic differences due to inbreeding or genetic exchange limited to other Arabic speaking populations in the south west of Iran, [74,75] can be the main reasons of this differentiation.